Mining Web data for Chinese segmentation
نویسندگان
چکیده
within documents as indexing terms for search of relevant documents. As Chinese is an ideographic character-based language, the words in the texts are not delimited by white spaces. Indexing of Chinese documents is impossible without a proper segmentation algorithm. Many Chinese segmentation algorithms have been proposed in the past. Traditional segmentation algorithms cannot operate without a large dictionary or a large corpus of training data. Nowadays, the Web has become the largest corpus that is ideal for Chinese segmentation. Although most search engines have problems in segmenting texts into proper words, they maintain huge databases of documents and frequencies of character sequences in the documents. Their databases are important potential resources for segmentation. In this paper, we propose a segmentation algorithm by mining Web data with the help of search engines. On the other hand, the Romanized pinyin of Chinese language indicates boundaries of words in the text. Our algorithm is the first to utilize the Roman-ized pinyin to segmentation. It is the first unified segmen-tation algorithm for the Chinese language from different geographical areas, and it is also domain independent because of the nature of the Web. Experiments have been conducted on the datasets of a recent Chinese segmenta-tion competition. The results show that our algorithm out-performs the traditional algorithms in terms of precision and recall. Moreover, our algorithm can effectively deal with the problems of segmentation ambiguity, new word (unknown word) detection, and stop words.
منابع مشابه
Geographic Information Retrieval and Text Mining on Chinese Tourism Web Pages
The World Wide Web (WWW) offers an enormous wealth of information and data, and assembles a tremendous amount of knowledge. Much of this knowledge, however, comprises either non-structured data or semi-structured data. To make use of these unexploited or underexploited resources more efficiently, the management of information and data gathering has become an essential task for research and deve...
متن کاملTowards Unified Chinese Segmentation Algorithm
As Chinese is an ideographic character-based language, the words in the texts are not delimited by spaces. Indexing of Chinese documents is impossible without a proper segmentation algorithm. Many Chinese segmentation algorithms have been proposed in the past. Traditional segmentation algorithms cannot operate without a large dictionary or a large corpus of training data. Nowadays, the Web has ...
متن کاملA Chinese-English Organization Name Translation System Using Heuristic Web Mining and Asymmetric Alignment
In this paper, we propose a novel system for translating organization names from Chinese to English with the assistance of web resources. Firstly, we adopt a chunkingbased segmentation method to improve the segmentation of Chinese organization names which is plagued by the OOV problem. Then a heuristic query construction method is employed to construct an efficient query which can be used to se...
متن کاملIntegrating AHP and data mining for effective retailer segmentation based on retailer lifetime value
Data mining techniques have been used widely in the area of customer relationship management (CRM). In this study, we have applied data mining techniques to address a problem in business-to-business (B2B) setting. In a manufacturer-retailer-consumer chain, a manufacturer should improve its relationship with retailers to continue its business. Segmentation is a useful tool for identifying groups...
متن کاملA Web-based Approach To Chinese Word Segmentation
Chinese text processing requires the detection of word boundaries. This is a non-trivial step because Chinese does not contain explicit whitespace between words. Existing word segmentation techniques make use of precompiled dictionaries and treebanks. The creation of dictionaries and treebanks is a labor-intensive process and consequently they are updated infrequently. Furthermore, due to their...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- JASIST
دوره 58 شماره
صفحات -
تاریخ انتشار 2007